Chapter 1: Introduction

Customer churn, also known as customer turnover, occurs when a customer chooses to leave or unsubscribe from a service for any reason. In the context of this project, we are looking at customer churn in banking. Being able to predict customer churn is important because we want to retain as many customers as we can. (Kaemingk, 2018.)

Of course you want to keep customers, but it also is cheaper to retain your customers, than having to spend money to acquire new customers to replace the ones that left. Lower customer acquisition costs equal more profits. By identifying customers at risk of churn, we are able to identify which efforts in order to maximize their likelihood of staying. (Guliyev et al., 2021.)

We chose to analyze a multinational bank’s churn for several reasons. As an international team, we were interested in topics that spanned multiple countries, with a particular focus on either business or environment, our personal interests. When we were pitching topics, nearly every suggestion was geared towards sales or customer experience, or global environmental issues.

The aim of this project is to identify the most useful model for predicting churn churn, across three countries serviced by a multinational bank.

Customer Analysis Overview

From our EDA in the mid term we found that more than 50% of the customer are based in France with majority of customer of average age group of 39 being in relation with bank for more than a year,51% of active customers while Male customers are predominantly higher than Female customers and also 71% of the customer uses credit card with average credit score of 650 by this analysis we found that 20% of the customer are churned,hence we decided to predict what factors are significantly affecting the customer churn in multinational bank.The model identified can then be used to inform banks of what customer activity trends and attributes should be monitored so the bank can make efforts to retain the customers before the account has been closed.

Chapter 2: Description of the Data

This study is based on a data set that contains 10,000 observations of bank customers with 12 variables. (See below for a readout of the dataset’s structure and variable names.) Variable descriptions are as follows and we choose the data set from Kaggle (Topre, 2022.)

## 'data.frame':    10000 obs. of  12 variables:
##  $ customer_id     : int  15634602 15647311 15619304 15701354 15737888 15574012 15592531 15656148 15792365 15592389 ...
##  $ credit_score    : int  619 608 502 699 850 645 822 376 501 684 ...
##  $ country         : chr  "France" "Spain" "France" "France" ...
##  $ gender          : chr  "Female" "Female" "Female" "Female" ...
##  $ age             : int  42 41 42 39 43 44 50 29 44 27 ...
##  $ tenure          : int  2 1 8 1 2 8 7 4 4 2 ...
##  $ balance         : num  0 83808 159661 0 125511 ...
##  $ products_number : int  1 1 3 2 1 2 2 4 2 1 ...
##  $ credit_card     : int  1 0 1 0 1 1 1 1 0 1 ...
##  $ active_member   : int  1 1 0 0 1 0 1 0 1 1 ...
##  $ estimated_salary: num  101349 112543 113932 93827 79084 ...
##  $ churn           : int  1 0 1 0 0 1 0 1 0 0 ...
  1. Customer ID - The Unique ID of each individual customer
  2. Credit Score - A number depicting the customer’s creditworthiness
  3. Country - The country the customer banks from
  4. Gender - The gender the customer identifies with
  5. Age - The customers age
  6. Tenure - Indicates how length in years the customer has been with the bank
  7. Balance - The amount currently available in the customer’s account
  8. Products Number - The number of products purchased by the customer through the bank
  9. Credit Card - Indicates the customer has a credit card
  10. Active Member - Indicates if the customer is an active or inactive
  11. Estimated Salary - Bank Estimation of the income of the customer
  12. Churn - Indicator of if the customer has left the bank or not

Chapter 3: Independent Variables EDA Recap

Cleaning the Data

In preparation for exploratory data analysis, we took several steps to clean the data. We immediately dropped the customer_id variable, as we do not need the account holder’s unique identifier for our purpose. After dropping customer_id our dataset had 11 variables:

## 'data.frame':    10000 obs. of  11 variables:
##  $ credit_score    : int  619 608 502 699 850 645 822 376 501 684 ...
##  $ country         : chr  "France" "Spain" "France" "France" ...
##  $ gender          : chr  "Female" "Female" "Female" "Female" ...
##  $ age             : int  42 41 42 39 43 44 50 29 44 27 ...
##  $ tenure          : int  2 1 8 1 2 8 7 4 4 2 ...
##  $ balance         : num  0 83808 159661 0 125511 ...
##  $ products_number : int  1 1 3 2 1 2 2 4 2 1 ...
##  $ credit_card     : int  1 0 1 0 1 1 1 1 0 1 ...
##  $ active_member   : int  1 1 0 0 1 0 1 0 1 1 ...
##  $ estimated_salary: num  101349 112543 113932 93827 79084 ...
##  $ churn           : int  1 0 1 0 0 1 0 1 0 0 ...

Next, we checked for duplicate records in the data set,we found no duplicates in the data set and also checked for null variables, and 0 were found.We then converted the following variables into categorical variables:credit_card, active_member, churn, gender, tenure, product number, age.We also converted Boolean values (0,1) into character format for Credit_card (credit_card,no-credit card), active_members (Active,In Active) and churn (Churned, Retained) variables in the data set, in order to understand each variable in the data set clearly during plotting and analysis.

credit_card active_member churn
Credit Card Active Churned
No-Credit Card Active Retained
Credit Card In Active Churned
No-Credit Card In Active Retained
Credit Card Active Retained
Credit Card In Active Churned

Finally, we checked our continuous variables for outliers using the outlierKD function.

## Outliers identified: 15 
## Proportion (%) of outliers: 0.2 
## Mean of the outliers: 361 
## Mean without removing outliers: 651 
## Mean if we remove outliers: 651 
## Outliers successfully removed

## Outliers identified: 359 
## Proportion (%) of outliers: 3.7 
## Mean of the outliers: 69.3 
## Mean without removing outliers: 38.9 
## Mean if we remove outliers: 37.8 
## Outliers successfully removed

By using the outlierKD function we can observe that outliers were found only in age and credit_score variables (age: 3.7% and credit_score: 0.2%), so we decided to remove these two variables using the outlierKD function.

Customer Analysis Review

In this section we are going to analyze each variable in the Bank churn dataset by using plots along with finding calculated mean, SD and percentages for each variable.

What is the average credit score of the customers?

The average credit score of the customer is 650.529,most of the customers having credit score fall between 600 to 700 and standard deviation is 96.953. The below histogram shows the range of credit scores.

Which are the different countries in which a customer holds a bank account?

The customers are grouped by the countries in which they have their accounts. As we see from the plot, France has more than 50% of customer accounts which is the highest among all other countries with Germany and Spain sharing equal percentages.Below, bar plot represents the number of customers in three different countries.

What is the average age of the customer?

The majority of the bank customers fall below the age of 50 with average age of 39 and with standard deviation of 10.5.

How many years does the customer have a relationship with the bank?

Most of the customers have been with the bank for more than a year.

What is the percentage of Male and Female customers who hold an account in the bank?

The bank’s customers are predominantly male which make up 55% of the customer base with the females making up the remaining 45%.

What is the percentage of Active and Inactive account holders in the bank?

48.5% of customers are being inactive.

What are the different types of services/Products purchased by customers provided by the bank for better usage of baking services?

Most of the customers use product 1 which is 50% and product 4 is the least used with 0.6%.

What percentage of customers make use of credit cards?

Predominantly 71% of the Bank customers use credit card and only 29% do not.

Currently what percentage of customers are churned from the bank?

The bank managed to retain 80% of their customers with the remaining 20% where churned out.

So, now the big question was what factors affected the 20% churn rate in the multinational bank across different countries. We utilized chi-squared tests, two-sample T-tests, and corrplots in order to find which factors influenced customer churn:

Variables’ Influence on Churn

## 
##  Pearson's product-moment correlation
## 
## data:  churn_data$balance and as.numeric(churn_data$churn)
## t = 12, df = 9998, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0992 0.1378
## sample estimates:
##   cor 
## 0.119

From the correlation plot we observe that age is highly correlated with churn.After combining different columns and finding the correlation score we see that it starts decreasing from 0.285

## [1] 0.285
## [1] 0.272
## [1] 0.283
## [1] 0.119
## [1] 0.269
## [1] 0.119

Chapter 4: Handling Imbalanced Data

In our EDA, we found out that our binary target variable, churn, was imbalanced with 7963 instances of 0/Retained customers, and 2037 instances of 1/Churned customers.

With 79.6% of customers 0/Retained, and 20.45 of customers 1/Churned, we needed to balance our dataset and bring this number closer to a 50/50 split for the purpose of our training set. We utilized the ovun.sample function from ROSE package to undersample the data, and result in a more even split. With the new proportions, we are ready to train-test-split and proceed:

Chapter 5: SMART QUESTIONS

Before we work on PCA, we need to ensure that the data is scaled and all the variables are either numeric or int type.

## 'data.frame':    10000 obs. of  11 variables:
##  $ credit_score    : int  619 608 502 699 850 645 822 376 501 684 ...
##  $ country         : chr  "France" "Spain" "France" "France" ...
##  $ gender          : chr  "Female" "Female" "Female" "Female" ...
##  $ age             : int  42 41 42 39 43 44 50 29 44 27 ...
##  $ tenure          : int  2 1 8 1 2 8 7 4 4 2 ...
##  $ balance         : num  0 83808 159661 0 125511 ...
##  $ products_number : int  1 1 3 2 1 2 2 4 2 1 ...
##  $ credit_card     : chr  "Credit Card" "No-Credit Card" "Credit Card" "No-Credit Card" ...
##  $ active_member   : chr  "Active" "Active" "In Active" "In Active" ...
##  $ estimated_salary: num  101349 112543 113932 93827 79084 ...
##  $ churn           : chr  "1" "0" "1" "0" ...

To perform dimensiontionality reduction while preserving as much as randomness in the high dimensional space as possible.

Now we will identify correlations in the data

## 'data.frame':    10000 obs. of  11 variables:
##  $ credit_score    : num  608 699 850 822 684 497 476 549 616 549 ...
##  $ country         : num  0 2 0 0 2 2 0 1 0 0 ...
##  $ gender          : num  1 1 1 1 1 0 0 1 0 0 ...
##  $ age             : int  41 39 43 50 27 24 34 25 45 24 ...
##  $ tenure          : int  1 1 2 7 2 3 10 5 3 9 ...
##  $ balance         : num  83808 0 125511 0 134604 ...
##  $ products_number : int  1 2 1 2 1 2 2 2 2 2 ...
##  $ credit_card     : num  1 0 1 0 1 1 1 1 0 1 ...
##  $ active_member   : num  1 1 0 0 1 0 1 0 1 1 ...
##  $ estimated_salary: num  112543 93827 79084 10063 71726 ...
##  $ churn           : num  0 0 0 0 0 0 0 0 0 0 ...
## 'data.frame':    10000 obs. of  10 variables:
##  $ credit_score    : num  608 699 850 822 684 497 476 549 616 549 ...
##  $ country         : num  0 2 0 0 2 2 0 1 0 0 ...
##  $ gender          : num  1 1 1 1 1 0 0 1 0 0 ...
##  $ age             : int  41 39 43 50 27 24 34 25 45 24 ...
##  $ tenure          : int  1 1 2 7 2 3 10 5 3 9 ...
##  $ balance         : num  83808 0 125511 0 134604 ...
##  $ products_number : int  1 2 1 2 1 2 2 2 2 2 ...
##  $ credit_card     : num  1 0 1 0 1 1 1 1 0 1 ...
##  $ active_member   : num  1 1 0 0 1 0 1 0 1 1 ...
##  $ estimated_salary: num  112543 93827 79084 10063 71726 ...
Table
credit_score country gender age tenure balance products_number credit_card active_member estimated_salary
credit_score 1.0000 0.0049 -0.0049 0.0062 0.0109 0.0076 -0.0191 -0.0013 -0.0069 -0.0142
country 0.0049 1.0000 -0.0047 0.0061 0.0026 -0.0035 -0.0082 -0.0085 0.0067 0.0039
gender -0.0049 -0.0047 1.0000 -0.0027 -0.0076 -0.0049 -0.0026 -0.0058 -0.0225 -0.0047
age 0.0062 0.0061 -0.0027 1.0000 -0.0160 0.0649 -0.0511 -0.0112 -0.0092 -0.0031
tenure 0.0109 0.0026 -0.0076 -0.0160 1.0000 0.0199 0.0139 -0.0085 -0.0067 0.0188
balance 0.0076 -0.0035 -0.0049 0.0649 0.0199 1.0000 -0.2042 -0.0047 0.0045 0.0007
products_number -0.0191 -0.0082 -0.0026 -0.0511 0.0139 -0.2042 1.0000 0.0160 -0.0002 0.0139
credit_card -0.0013 -0.0085 -0.0058 -0.0112 -0.0085 -0.0047 0.0160 1.0000 -0.0119 0.0003
active_member -0.0069 0.0067 -0.0225 -0.0092 -0.0067 0.0045 -0.0002 -0.0119 1.0000 0.0054
estimated_salary -0.0142 0.0039 -0.0047 -0.0031 0.0188 0.0007 0.0139 0.0003 0.0054 1.0000
Table
credit_score country gender age tenure balance products_number credit_card active_member estimated_salary
credit_score 1.0000 0.0049 -0.0049 0.0062 0.0109 0.0076 -0.0191 -0.0013 -0.0069 -0.0142
country 0.0049 1.0000 -0.0047 0.0061 0.0026 -0.0035 -0.0082 -0.0085 0.0067 0.0039
gender -0.0049 -0.0047 1.0000 -0.0027 -0.0076 -0.0049 -0.0026 -0.0058 -0.0225 -0.0047
age 0.0062 0.0061 -0.0027 1.0000 -0.0160 0.0649 -0.0511 -0.0112 -0.0092 -0.0031
tenure 0.0109 0.0026 -0.0076 -0.0160 1.0000 0.0199 0.0139 -0.0085 -0.0067 0.0188
balance 0.0076 -0.0035 -0.0049 0.0649 0.0199 1.0000 -0.2042 -0.0047 0.0045 0.0007
products_number -0.0191 -0.0082 -0.0026 -0.0511 0.0139 -0.2042 1.0000 0.0160 -0.0002 0.0139
credit_card -0.0013 -0.0085 -0.0058 -0.0112 -0.0085 -0.0047 0.0160 1.0000 -0.0119 0.0003
active_member -0.0069 0.0067 -0.0225 -0.0092 -0.0067 0.0045 -0.0002 -0.0119 1.0000 0.0054
estimated_salary -0.0142 0.0039 -0.0047 -0.0031 0.0188 0.0007 0.0139 0.0003 0.0054 1.0000

Linear Regression

Does the combination of customer demographic variables country, age group, gender influence the churn rate of the customers?

## 
## Call:
## lm(formula = churn ~ country + age + gender, data = churn_data_pc_target)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.256 -0.385 -0.227  0.478  0.925 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.191973   0.019323   -9.93   <2e-16 ***
## country     -0.005029   0.005644   -0.89     0.37    
## age          0.015744   0.000436   36.10   <2e-16 ***
## gender      -0.011436   0.009380   -1.22     0.22    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.467 on 9996 degrees of freedom
## Multiple R-squared:  0.116,  Adjusted R-squared:  0.115 
## F-statistic:  435 on 3 and 9996 DF,  p-value: <2e-16

From the results of linear regression model we observe that the variables Country$Germany,age,and Gender are highly significant.While gender is negatively impacting the customer churn but customer with certain age group and from Germany affecting the churn rate positively.

Whether the customers using fewer products with higher salary,account balance and status of the account affect the churn rate?

## 
## Call:
## lm(formula = churn ~ products_number + estimated_salary + balance + 
##     active_member, data = churn_data_pc_target)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.633 -0.465 -0.328  0.515  0.753 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.80e-01   1.81e-02   20.99  < 2e-16 ***
## products_number  -3.41e-02   7.68e-03   -4.44  9.1e-06 ***
## estimated_salary  1.38e-07   8.58e-08    1.61     0.11    
## balance           1.13e-06   8.13e-08   13.85  < 2e-16 ***
## active_member     1.39e-02   9.81e-03    1.41     0.16    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.49 on 9995 degrees of freedom
## Multiple R-squared:  0.0246, Adjusted R-squared:  0.0242 
## F-statistic: 62.9 on 4 and 9995 DF,  p-value: <2e-16

From the results of linear regression model we observe that the variables account balance,account status are highly significant.while the account status impacting the customer churn negatively,but account balance is affecting positively.

Finding the Best Model

What are the principal components for predicting churn rate?

## [1] "Case: z-score/scaled"
## Importance of components:
##                          PC1   PC2   PC3   PC4   PC5    PC6   PC7    PC8    PC9
## Standard deviation     1.112 1.017 1.009 1.006 1.004 0.9975 0.990 0.9859 0.9757
## Proportion of Variance 0.124 0.103 0.102 0.101 0.101 0.0995 0.098 0.0972 0.0952
## Cumulative Proportion  0.124 0.227 0.329 0.430 0.531 0.6305 0.729 0.8257 0.9209
##                          PC10
## Standard deviation     0.8892
## Proportion of Variance 0.0791
## Cumulative Proportion  1.0000
##                       PC1     PC2      PC3      PC4     PC5     PC6    PC7
## credit_score      0.08686 -0.0497  0.26077 -0.72578  0.2115 -0.0281  0.110
## country           0.02565  0.2682 -0.12200 -0.36430 -0.3858  0.7019 -0.337
## gender           -0.01024 -0.4806  0.23490  0.16361 -0.5333 -0.0996 -0.419
## age               0.33121 -0.1171 -0.13614  0.01863 -0.1288  0.2761  0.618
## tenure           -0.00162  0.4256  0.68282 -0.08285  0.0277 -0.1706 -0.107
## balance           0.66322  0.0683  0.08162  0.12278  0.0842 -0.0486 -0.147
## products_number  -0.65915  0.0158  0.00953 -0.03794 -0.0293  0.0428  0.194
## credit_card      -0.07459 -0.2234  0.06652  0.21656  0.6926  0.4630 -0.337
## active_member     0.00213  0.4912 -0.53758 -0.00862  0.0642 -0.3091 -0.301
## estimated_salary -0.04565  0.4513  0.27747  0.49301 -0.1136  0.2778  0.204
##                      PC8     PC9    PC10
## credit_score     -0.5781 -0.0409  0.0415
## country           0.1486 -0.0273  0.0411
## gender           -0.3499  0.3077  0.0176
## age              -0.0162  0.6149 -0.0648
## tenure            0.3348  0.4220 -0.1192
## balance           0.0370 -0.0671  0.7048
## products_number   0.0100  0.2143  0.6910
## credit_card      -0.0897  0.2749 -0.0439
## active_member    -0.3233  0.4178 -0.0209
## estimated_salary -0.5430 -0.2150 -0.0354

Correlation can vary from -1 to +1 Value close to +1 indicates a high degree of positive correlation Value close to -1 indicates a high degree of negative correlation Value close to zero or 0 indicates no correlation at all

## [1] 7034
## 
##     0     1 
## 0.795 0.205

From the Biplot between PC1 and PC2 , first of all, it shows the data points plotted (projection) onto the PC1-PC2 plane, with the scales on the bottom and the left tick marks. It also shows the components on this PC1-PC2 plane, with the scales on the top and right tick marks showing the loadings.

PC1 has more variation than PC2

We have also plotted the graphs in between PC2-PC3, AND PC3-PC4: Observe that PC2 has more variation than PC3 and PC3 has more variation than PC4

Here, all components capture the majority of the variability.

From the graph for Proportion of variance we can see that almost 90% of variance is explained by 9 principal components.

Feature Selection

Which model can give the best results based on adjusted R square value, along with lower BIC and Cp?

Feature selection using Exhaustive Search

Using exhaustive search age,balance were selected which is a two variable model with adjusted R^2 value 0.14.

The best model selected using BIC is credit_score,age,balance which is a 3 variable model with BIC value -1500.

The best model selected using Cp is credit_score,age,tenure,balance,estimated_salary which is a 5 variable model. with Cp value 7.

Mallow Cp Plot The Mallow Cp plot selected two best models.

1)credit_score,country,gender,age,tenure,balance,active_member,estimated salary 8 variable model

2)credit_score,country,gender,age,tenure,balance,estimated_salary.

Feature selection using forward search

The best model selected using forward search is age,balance which is a 2 variable model with adjusted R^2 value of 0.14.

The best model selected using BIC is credit_card,age,balance which is a 3 variable model with value -1500 Cp.

The best model selected using Cp is credit_score,age,tenure,balance,estimated_salary which is a 5 variable model with Cp value 7.

Feature selection using backward search

The best model selected using backward search is age,balance which is a 2 variable model with adjusted R^2 value of 0.14.

The best model selected using BIC is credit_card,age,balance which is a 3 variable model with value -1500 Cp.

The best model selected using Cp is credit_score,age,tenure,balance,estimated_salary which is a 5 variable model with Cp value 7.

Feature selection using Sequential Replacement

The best model selected using sequential search is age,balance which is a 2 variable model with adjusted R^2 value of 0.14.

The best model selected using BIC is credit_card,age,balance,estimated_salary which is a 4 variable model with value -1500 Cp.

The best model selected using Cp is credit_score,age,tenure,balance,estimated_salary which is a 5 variable model with Cp value 7.

The best models selected using feature selection methods are

1 Age, balance(2 variable model)

  • Forward search: Adj R^2,

  • Exhaustive search: Adj R^2

  • Backward search: Adj R^2

  • Sequential search:Adj R^2

2 Credit_score,age,balance(3 variable model)

  • Forward search: BIC

  • Exhaustive search: BIC

  • Backward search: BIC

  • Sequential search:BIC

3 Credit_score,age,tenure,balance,estimated_salary(5 variable model)

  • Exhaustive search: CP

  • Forward search: CP

  • Backward search: CP

  • Sequential search:CP

4.Credit_score, country, gender, age, tenure, balance, estimated_salary(7 variable model)

  • Mallow CPG

5 Credit_score,country,gender,age,tenure,balance,active_member,estimated_salary(8 variable model)

  • Mallow CPG

Chapter 6: Comparing the Models Chosen in Feature Selection

Methods of Model Evaluation

When evaluating the models, we are focusing particularly on False negatives because it is vital for any bank to correctly predict churn and in an ideal scenario without any errors. False negatives are of highest importance as we do not want to incorrectly predict that the customers are retained, when they are in fact churned.

The second way we will be evaluating the models is by Area Under the Curve. The higher this number is, the better our model.

Model with all variables:

## 
## Call:
## glm(formula = churn ~ ., family = "binomial", data = churn_data_logit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.705  -0.959  -0.663   1.089   2.206  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.97e+00   1.97e-01  -15.04  < 2e-16 ***
## credit_score     -6.47e-04   2.22e-04   -2.92  0.00354 ** 
## country          -2.39e-02   2.62e-02   -0.91  0.36143    
## gender           -4.81e-02   4.35e-02   -1.11  0.26867    
## age               7.10e-02   2.27e-03   31.31  < 2e-16 ***
## tenure           -2.17e-04   7.49e-03   -0.03  0.97688    
## balance           4.53e-06   3.62e-07   12.51  < 2e-16 ***
## products_number  -1.15e-01   3.34e-02   -3.43  0.00061 ***
## credit_card      -2.26e-02   4.75e-02   -0.48  0.63478    
## active_member     7.66e-02   4.34e-02    1.77  0.07744 .  
## estimated_salary  6.48e-07   3.79e-07    1.71  0.08750 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13722  on 9999  degrees of freedom
## Residual deviance: 12297  on 9989  degrees of freedom
## AIC: 12319
## 
## Number of Fisher Scoring iterations: 4

Model 1 - churn ~ age + balance Looking at the Adjusted R square results we can see that we get a maximum value of 0.14 with 2 variables namely age and balance for exhaustive search method. If we look at the variables that give the same value of 0.14 for Adjusted R square for forward, backward and sequential search we have the same variables age and balance. Therefore, we build model1 with parameters age and balance.

## 
## Call:
## glm(formula = churn ~ age + balance, data = churn_data_logit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.275  -0.379  -0.193   0.462   0.991  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.68e-01   1.89e-02   -14.2   <2e-16 ***
## age          1.54e-02   4.33e-04    35.5   <2e-16 ***
## balance      1.03e-06   7.52e-08    13.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.214)
## 
##     Null deviance: 2465.0  on 9999  degrees of freedom
## Residual deviance: 2140.8  on 9997  degrees of freedom
## AIC: 12973
## 
## Number of Fisher Scoring iterations: 2
Confusion matrix from Logit Model-1
Predicted 0 Predicted 1 Total
Actual 0 4557 1035 5592
Actual 1 2089 2319 4408
Total 6646 3354 10000
## Area under the curve: 0.734

Model 2 - churn ~ credit_card+age + balance Similarly, for the second model looking at the BIC values we can see that from feature selection we have a minimum value of -1500 from exhaustive search method with 3 variables namely credit_score, age and balance. We can see the same variables associated in both forward search and backward search. For sequential search, we have an additional variable estimated_salary along with the other three variables. Therefore, if you take the variables common in all these search techniques we are left with credit_score, age and balance so we build model2 with these variables.

## 
## Call:
## glm(formula = churn ~ credit_card + age + balance, data = churn_data_logit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.273  -0.380  -0.194   0.461   0.993  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.64e-01   2.03e-02  -13.05   <2e-16 ***
## credit_card -4.87e-03   1.02e-02   -0.48     0.63    
## age          1.54e-02   4.33e-04   35.46   <2e-16 ***
## balance      1.03e-06   7.52e-08   13.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.214)
## 
##     Null deviance: 2465.0  on 9999  degrees of freedom
## Residual deviance: 2140.8  on 9996  degrees of freedom
## AIC: 12975
## 
## Number of Fisher Scoring iterations: 2
Confusion matrix from Logit Model-2
Predicted 0 Predicted 1 Total
Actual 0 4571 1021 5592
Actual 1 2100 2308 4408
Total 6671 3329 10000
## Area under the curve: 0.734

Model 3 - churn ~ credit_score+age+tenure+ balance + active_member+estimated_salary Now for the third model we repeat the same procedure but we use the Cp metric. We have a minimum value of 7 for Cp with the variables credit_score,age,tenure,balance and estimated_salary for the exhaustive search method. We have the same variables associated with Cp for forward, backward and sequential search. Therefore, we choose model3 with features credit_score,age,tenure,balance and estimated_salary

## 
## Call:
## glm(formula = churn ~ credit_score + age + tenure + balance + 
##     estimated_salary, data = churn_data_logit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.270  -0.382  -0.195   0.461   1.002  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.91e-01   3.78e-02   -5.07  4.1e-07 ***
## credit_score     -1.39e-04   4.74e-05   -2.94   0.0033 ** 
## age               1.54e-02   4.33e-04   35.50  < 2e-16 ***
## tenure           -1.70e-04   1.60e-03   -0.11   0.9153    
## balance           1.03e-06   7.52e-08   13.68  < 2e-16 ***
## estimated_salary  1.39e-07   8.09e-08    1.72   0.0861 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.214)
## 
##     Null deviance: 2465.0  on 9999  degrees of freedom
## Residual deviance: 2138.3  on 9994  degrees of freedom
## AIC: 12967
## 
## Number of Fisher Scoring iterations: 2
Confusion matrix from Logit Model-3
Predicted 0 Predicted 1 Total
Actual 0 4566 1026 5592
Actual 1 2072 2336 4408
Total 6638 3362 10000
## Area under the curve: 0.734

Model 4 - churn ~ credit_score + country+ gender+age+tenure+balance+estimated_salary For model4, we need to look at Mallow Cp plot in order to understand the feature selection method used for this model. We can see that for a value of 7 for Mallow Cp we have two models overlapping so, we can’t quite figure out which model to be chosen. The next best value for Mallow Cp is 7.5 with the following features credit_score,country,gender,age,tenure,balance,estimated_salary. We select these features for model4.

## 
## Call:
## glm(formula = churn ~ credit_score + country + gender + age + 
##     tenure + balance + estimated_salary, data = churn_data_logit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.278  -0.381  -0.196   0.461   1.009  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.83e-01   3.82e-02   -4.78  1.8e-06 ***
## credit_score     -1.39e-04   4.74e-05   -2.94   0.0033 ** 
## country          -4.69e-03   5.59e-03   -0.84   0.4016    
## gender           -1.09e-02   9.29e-03   -1.17   0.2409    
## age               1.54e-02   4.33e-04   35.50  < 2e-16 ***
## tenure           -1.80e-04   1.60e-03   -0.11   0.9102    
## balance           1.03e-06   7.52e-08   13.67  < 2e-16 ***
## estimated_salary  1.39e-07   8.09e-08    1.71   0.0865 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.214)
## 
##     Null deviance: 2465.0  on 9999  degrees of freedom
## Residual deviance: 2137.9  on 9992  degrees of freedom
## AIC: 12969
## 
## Number of Fisher Scoring iterations: 2
Confusion matrix from Logit Model-4
Predicted 0 Predicted 1 Total
Actual 0 4567 1025 5592
Actual 1 2056 2352 4408
Total 6623 3377 10000
## Area under the curve: 0.734

Model 5 - churn ~ credit_score+country+gender+age+tenure+balance+active_member+estimated_salary

## 
## Call:
## glm(formula = churn ~ credit_score + country + gender + age + 
##     tenure + balance + active_member + estimated_salary, data = churn_data_logit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.271  -0.381  -0.196   0.460   1.008  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.92e-01   3.86e-02   -4.98  6.6e-07 ***
## credit_score     -1.39e-04   4.74e-05   -2.93   0.0034 ** 
## country          -4.76e-03   5.59e-03   -0.85   0.3948    
## gender           -1.05e-02   9.29e-03   -1.13   0.2576    
## age               1.54e-02   4.33e-04   35.52  < 2e-16 ***
## tenure           -1.60e-04   1.60e-03   -0.10   0.9204    
## balance           1.03e-06   7.52e-08   13.66  < 2e-16 ***
## active_member     1.66e-02   9.26e-03    1.79   0.0733 .  
## estimated_salary  1.38e-07   8.09e-08    1.70   0.0882 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.214)
## 
##     Null deviance: 2465.0  on 9999  degrees of freedom
## Residual deviance: 2137.2  on 9991  degrees of freedom
## AIC: 12968
## 
## Number of Fisher Scoring iterations: 2
Confusion matrix from Logit Model-5
Predicted 0 Predicted 1 Total
Actual 0 4577 1015 5592
Actual 1 2062 2346 4408
Total 6639 3361 10000
## Area under the curve: 0.734

Best Model

Model 4 - Credit_score, country, gender, age, tenure, balance, active_member,estimated_salary

Chapter 6 :Summary

When creating a model based solely on customer demographics, only the age variable is significant. On the other hand based on salary and account features (number of products, balance, active status), only the balance variable is significant.

In our analysis, 90% of variance explained by 9 principle components with a PC9 Cumulative proportion of variance is 0.92.

Best model Adj R^2: Age + Balance, BIC: Credit Score + Age + Balance, Cp: Credit Score + Age + Balance + Tenure + Est. Salary

Best predictor of customer churn over all: Model 5, which includes features Credit_score + country + gender + age + tenure + balance + active_member + estimated_salary is best performing which has comparatively lower False Negatives (1994) and Area under curve (0.745). The reason for choosing False Negatives is that we are okay with the model predicting churn for a customer when it is actually retained. However, if a customer is actually churned and our model is predicting that it is retained then it is a problem and the errors in classifications must be highlighted as it is very critical for the bank to correctly identify churned customers. Therefore, the verdict for the best models is decided based on the which model has lower False Negatives.

Chapter 7: Bibliography

Kaemingk, D. (2018, August 29). Reducing customer churn for banks and financial institutions. Qualtrics. Retrieved November 2, 2022, from https://www.qualtrics.com/blog/customer-churn-banking/

Guliyev, H., & Yerdelen Tatoğlu, F. (2021). Customer churn analysis in banking sector: Evidence from explainable machine learning models. Journal of Applied Microeconometrics, 1(2), 85–99. https://doi.org/10.53753/jame.1.2.03